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Abstract 

Background: The functions of proteins are closely related to their subcellular locations. In the post-genomics era, 
the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular 
localization by computational means. 

Results: This paper proposes mitigating the computation burden of alignment-based approaches to subcellular 
localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the 
informative segments of protein sequences are identified by a cleavage site predictor using the information in 
their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the 
shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently 
predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training 
and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the 
perturbational discriminant analysis (PDA). 

Conclusions: Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can 
make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable 
to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both 
profile creation time and profile alignment time can be reduced without significant reduction in subcellular 
localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We 
advocate that the method will be important for biologists to conduct large-scale protein annotation or for 
bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments. 
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Background 

Motivation of subcellular localization prediction 

For a protein to function properly, it must be trans- 
ported to the correct organelles of a cell and folded into 
correct 3-D structures. Therefore, knowing the subcellu- 
lar localization of a protein is one step towards under- 
standing its functions. However, the determination of 
subcellular localization by experimental means is often 
time-consuming and laborious. Given the large number 
of un-annotated sequences from genome projects, it is 
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imperative to develop efficient and reliable computation 
techniques for annotating biological sequences. 

In recent years, impressive progress has been made in 
the computational prediction of subcellular localization. 
A number of approaches have also been proposed in the 
literature. These methods can be generally divided into 
four categories, including predictions based on sorting 
signals [1-6], global sequence properties [7-10], homol- 
ogy [11-13] and other information in addition to 
sequences [14,15]. Methods based on sorting signals are 
very fast, but they typically suffer from low prediction 
accuracy. Homology-based methods are more accurate, 
but they are very slow. Therefore, fast and reliable pre- 
dictions of subcellular localization still remain a 
challenge. 
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Approaches to subcellular localization prediction 

Signal-based methods predict the localization via the 
recognition of N-terminal sorting signals in amino acid 
sequences. PSORT, proposed by Nakai in 1991 [2], is 
one of the early predictors that use sorting signals for 
protein's subcellular localization. PSORT and its exten- 
sions - WoLF PSORT [3,4] - derive features such as 
amino acid compositions and the presence of sequence 
motifs for localization prediction. In the late 90's, 
researchers started to investigate the application of 
neural networks [16] to recognize the sorting signals. In 
a neural network, patterns are presented to the input 
layer of artificial neurons, with each neuron implement- 
ing a nonlinear function of the weighted sum of the 
inputs. Because amino acid sequences are of variable 
length, the input to the neural network is extracted 
from a short window sliding over the amino acid 
sequence. TargetP [17,18] is a well-known predictor that 
uses neural networks. 

Another type of approaches relies on the fact that pro- 
teins of different organelles have different global proper- 
ties such as amino-acid composition. Based on amino- 
acid composition and residue-pair frequencies, Naka- 
shima and Nishikawa [10] developed a predictor that 
can discriminate between soluble intracellular and extra- 
cellular proteins. Another popular predictor based on 
amino acid composition is SubLoc [7]. In SubLoc, a 
query sequence is converted to 20-dim amino-acid com- 
position vector for classification by support vector 
machines (SVMs). Recently, Xu et al. [19] proposed a 
semi-supervised learning technique (a kind of transduc- 
tive learning) that makes use of unlabelled test data to 
boost the classification performance of SVMs. One lim- 
itation of composition-based methods is that informa- 
tion about the sequence order is not easy to represent. 
Some authors proposed using amino-acid pair composi- 
tions (dipeptide) [8, 9, 20] and pseudo amino-acid com- 
positions [21] to enrich the representation power of the 
extracted vectors. 

The homology-based methods use the query sequence 
to search protein databases for homologs [11,12] and 
predict the subcellular location of the query sequence as 
the one to which the homologs belong. This kind of 
method can achieve very high accuracy when homologs 
of experimentally verified sequences can be found in the 
database search [22] . A number of homology-based pre- 
dictors have been proposed. For example, Proteome 
Analyst [23] uses the presence or absence of the tokens 
from certain fields of the homologous sequences in the 
Swiss-Prot database as a means to compute features for 
classification. In Kim et al. [24], an unknown protein 
sequence is aligned with every training sequences (with 
known subcellular locations) to create a feature vector 
for classification. Mak et al. [13] proposed a predictor 



called PairProSVM that uses profile alignment to detect 
weak similarity between protein sequences. Given a 
query sequence, a profile is obtained from PSI-BLAST 
search [25]. The profile is then aligned with every train- 
ing profile to form a score vector for classification by 
SVMs. 

Some predictors not only use amino acid sequences as 
input but also require extra information such as lexical 
context in database entries [14] or Gene Ontology 
entries [15] as input. Although studies have shown that 
this type of method can outperform sequence-based 
methods, the performance has only been measured on 
data sets where all sequences have the required addi- 
tional information. 

Limitations of existing approaches 

Among all the methods mentioned above, the signal- 
based and homology-based methods have attracted a 
great deal of attention, primarily because of their bio- 
logical plausibility and robustness in predicting newly 
discovered sequences. Comparing these two 
approaches, the signal-based methods seem to be more 
direct, because they determine the localization from 
the sequence segments that contain the localization 
information. However, this type of method is typically 
limited to the prediction of a few subcellular locations 
only. For example, the popular TargetP [5,6] can only 
detect three localizations: chloroplast, mitochondria, 
and secretory pathway signal peptide. The homology- 
based methods, on the other hands, can in theory pre- 
dict as many localizations as available in the training 
data. The downside, however, is that the whole 
sequence is used for the homology search or pairwise 
alignment, without considering the fact that some seg- 
ments of the sequence are more important or contain 
more information than the others. Moreover, the com- 
putation requirement will be excessive for long 
sequences. The problem will become intractable for 
database annotation where tens of thousands of pro- 
teins are involved. 

Our proposal for addressing the limitations 

Our earlier report [26] has demonstrated that computa- 
tion time of subcellular localization based on profile 
alignment SVMs can be substantially reduced by align- 
ing profiles up to the cleavage site positions of signal 
peptides, mitochondrial targeting peptides, and chloro- 
plast transit peptides. Although 20-fold reduction in 
total computation time (including alignment, training 
and recognition time) has been achieved, the method 
fails to reduce the profile creation time, which will 
become a substantial part of the total computation time 
when the database becomes large. In this paper, we pro- 
pose a new approach that can reduce both the profile 
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creation time and profile alignment time. In the new 
approach, instead of cutting the profiles, we shorten the 
sequences by cutting them at the cleavage site locations. 
The shortened sequences are then presented to PSI- 
BLAST to compute the profiles. To further reduce the 
training and recognition time of the classifier, we pro- 
pose replacing the SVMs by kernel perturbation 
discriminants. 

Fusion of signal- and homology-based methods 

Fig. 1 shows the histograms of the length of signal pep- 
tides (SP), mitochondrial transit peptides (mTP), and 
chloroplast transit peptides (cTP). The length is the 
number of amino acids from the N-terminus up to the 
cleavage site. It is obvious that the lengths of these pep- 
tides are rather short. Given the fact that the majority of 
proteins in the Swiss-Prot database have about a few 
hundred amino acids and that some proteins could have 
length longer than 5,000 amino acids, tremendous com- 
putational saving can be achieved by combining the sig- 
nal-based and homology-based methods described 
below. 

Truncation of profiles/sequences 

We have investigated two fusion schemes (see Fig. 2): 

I: Truncating Profiles, Given a query sequence, we pass 
it to PSI-BLAST [25] to determine a full-length profile 
(PSSM and PSFM [13]). The profile is then truncated at 
the cleavage site position. The truncated profile is 
aligned with each of the training profiles to create a vec- 
tor for classification. Note that the training profiles are 
also created by the same procedure. 

II: Truncating Sequences, Given a query sequence, we 
truncate it at the cleavage site and pass the truncated 
sequence to PSI-BLAST to determine a short-length 
profile. The profile is then aligned with all of the train- 
ing profiles to create a vector for classification. All train- 
ing profiles are also created by the same procedure. 



Note that as the time taken by PSI-BLAST search 
(profile-creation time) is proportional to the query 
sequence, Scheme II is expected to provide more com- 
putation saving than Scheme I. However, as the 
sequences are truncated at an early stage, important 
information may be lost if cleavage site prediction is 
inaccurate. The "Results and Discussion" Section pro- 
vides experimental evidences suggesting that Scheme II 
can provide significant computation saving without suf- 
fering from severe information loss. 

Cleavage site prediction 

This work investigated two cleavage site predictors: con- 
ditional random fields (CRFs) [27,28] and TargetP [5,6]. 
CRFs [29] were originally designed for sequence label- 
ling tasks such as Part-of-Speech (POS) tagging. Given a 
sequence of observations, a CRF finds the most likely 
label for each of the observations. To use CRFs for clea- 
vage site prediction, amino acid sequences are treated as 
observations and each amino acid in the sequences is 
labelled as either Signal, Cleavage, or Mature, e.g., 
SSSSSSCMMMMMM, as illustrated in Fig. 3. The clea- 
vage site is located at the transition between C and M. 
Amino acids of similar properties can be categorized 
according to their hydrophobicity and charge/polarity as 
shown in Table 1. These properties are used because 
the h-region of signal peptides is rich in hydrophobic 
residues and the c-region is dominated by small, non- 
polar residues [30]. Moreover, as illustrated in Fig. 4, 
the degree of hydrophobicity is also very different at dif- 
ferent positions, making this feature useful for the label- 
ling task. 

TargetP is one of the most popular signal-based sub- 
cellular localization predictors and cleavage site predic- 
tors. Given a query sequence, TargetP can determine its 
subcellular localization and will also invoke SignalP [31], 
ChloroP [32], or a program specialized for mTP to 
determine the cleavage site of the sequence. TargetP 




(a) (b) (c) 



Figure 1 Length distribution of SP, mTP, and cTP. The histograms of length of (a) secretory pathway signal peptides, (b) mitochondrial 

targeting peptides, and (c) chloroplast transit peptides. The length is the number of amino acids from the N-terminus up to the cleavage site. 

Vertical axes: number of occurrences. Horizontal axes: sequence length. 
I J 
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Figure 2 Two schemes for computation saving. Two schemes for reducing the computation of the subcellular localization process. In 
Scheme I, a full-length query sequence is presented to PSI-BLAST for computing a full-length profile; then the profile is truncated at the 
predicted cleavage site. The truncated profile is then aligned with all of the truncated training profiles to produce a profile-alignment score 
vector for classification. In Scheme II, the query sequence is truncated at the predicted cleavage site before inputting to PSI-BLAST for 
computing the profile. The cleavage sites are predicted by CSitePred [27] or TargetP [5]. 



requires the N-terminal sequence of a protein as input. 
During prediction, a sliding window scans over a query 
sequence; for each segment within the window, a 
numerically encoded vector is presented to a neural net- 
work to compute the segment score. The cleavage site is 
determined by finding the position at which the score is 
maximum. The cleavage site prediction accuracy of Sig- 
nalP on Eukaryotic proteins is around 70% [33] and that 
of ChloroP on cTP is 60% (±2 residues) [32]. 

Methods 

Data preparation 

Protein sequences with experimentally annotated subcel- 
lular locations were extracted from the Swiss-Prot 
Release 57.5 according to the following criteria. 

1. Only the entries of Eukaryotic species, which were 
annotated with "Eukaryota" in the OC (Organism Classi- 
fication) fields in Swiss-Prot, were included. 

2. Entries annotated with ambiguous words, such as 
"probable", "by similarity" and "potential", were excluded 
because of the lack of experimental evidence. 

3. Sequences annotated with "fragment" were 
excluded. 

4. For signal peptides, mitochondria, and chloroplast, 
only sequences with experimentally annotated cleavage 
sites were included. 



The extracted sequences were then filtered by BLAST- 
Clust [34] so that the resulting sequences have sequence 
identity less than 25%. Table 2 shows the breakdown of 
the dataset. A modified version of the Perl scripts pro- 
vided by [35] was used for creating the dataset. 

PDA and SVM for multi-class classification 

We used perturbational discriminant analysis (PDA) [36] 
and support vector machines (SVMs) [37] for classifica- 
tion. The formulation of PDA can be found in the 
Appendix. During the training phase, N training profiles 
were obtained by Scheme I or Scheme II. Pair-wise pro- 
file-alignments were then performed to create an N x N 
symmetric score matrix K, which were then used to 
train the PDA and SVM classifiers as follows. 
One-vs-rest PDA and SVM classifier 

A C-class problem can be formulated as C binary classi- 
fication problems in which each problem is solved by a 
binary classifier. Given the training sequences of C 
classes, we trained C PDA score functions: 

f i {x) = ajk{x) + b i , i = l,...,C, (1) 

where x is a query sequence, 
k[x) = [K{x lr x) r ... f K(x Nr x)] T contains the similarity 
(via profile alignment) between x and the N training 
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Figure 3 CRFs for cleavage site prediction. Conditional random fields (CRFs) for cleavage cite prediction. Given a sequence of observations, 
each amino acid in the sequences is labelled as either "Signal", "Cleavage", or "Mature", e.g., SSSSSSCMMMMMM. The cleavage site is located at 
the transition between C and M. 
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Table 1 Grouping of amino acids according to their 
hydrophobicity and charge/polarity [43]. 



Property 


Group 




Hydrophobicity 


H1={D,E,N,Q,R,K} 

H2={C,S,T,P,G,H,Y} 

H3={A,M,I,L,V,F,W} 




Charge/Polarity 


C1={R,K,H} 
C2={D,E} 

C3={C,T,S,G ; N,Q ; Y} 
C4={A,P,M,L,I,V,F,W} 







3.5 



3 - 




Hydrophobicity -5 Position 



(b) 

Figure 4 The mean (top) and the histograms (bottom) of 
hydrophobicity of 179 signal peptides at different sequence 
positions. The cleavage site of these sequences is between Positions 
19 and 20. 

V J 



Table 2 Breakdown of eukaryotic dataset derived from 
the Swiss-Prot database (release 57.5). 



Class Index 


Subcellular Location 


Number of Proteins 


1 


Extracellular 


693 


2 


Mitochondria 


167 


3 


Chloroplast 


74 


4 


Others(Cytoplasm/Nucleus) 


1617 


2552(total) 



profiles, and a t and b t were obtained by Eq. 11 and Eq. 
12 in the Appendix. 

For the SVM classifier, the score functions in Eq. 1 are 
replaced by the linear SVM score functions: 

N 

fi(x) = X ( ^ j) ) + b 'f i = l " ' " C > 

where a^s are the Lagrange multipliers of Class i, and 
jij = 1 if Xj belongs to Class i and y t j = -1 otherwise. 
Then, given a test sequence x, the class label is given by 

c 

/ = arg max /,-(#). 



Cascaded fusion of PDA and SVM 

Instead of using Eqs. 11 and 12, the optimal weights in 
PDA can also be equivalently expressed in terms of d 
and 77 in Eqs. 8 and 9. In a C-class problem, the i-th. 
class will have its corresponding d t and r\ b where i = 1, 
...,C. However, because of the dependence in d b the 
rank of matrix [d lt d c ] is C - 1. Therefore, there are 
C - 1 independent sets of PDA parameters: 

A = [a lf ...,a c _ 1 ] 

= {K + piy\[d l dcJ-lfc ric^l 

where 1 is an TV-dim vector of all l's and p is a pertur- 
bation parameter. During recognition, an unknown sam- 
ple x is projected onto a (C - l)-dim PDA space 
spanned by a C -\[ using 
g(x) = A T k(x) + [*!,..., Z ?C _ 1 ] T , g(x) e ^H c -\ 
Then, g(x) is classified by one-vs-rest RBF-SVMs. In 
the sequel, we refer to this cascaded fusion as PDAproj 
+ SVM. Fig. 5 exemplifies the capability of PDAproj 
+SVM using a 2-dim multi-class problem. 

Performance evaluation 

We used 5-fold cross validation to evaluate the perfor- 
mance. The overall prediction accuracy, the accuracy for 
each subcellular location, and the Matthew's correlation 
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Figure 5 Multi-class classification by perturbational discriminant analysis (PDA), (a) 2-dim data {x h ... r x N } of 3 different classes in the input space, 
where N = 300. (b) N x N RBF kernel matrix K, where k-,j = exp{- x,- - x } 2 /2}; each column of K is an /V-dim vector k(xi) in the empirical space K. 
(c) Projected data g(xl) = A J k{Xj) + b on the PDA space where data can be easily classified by 1-vs-rest SVMs. (d) Decision boundaries produced 
by PDAproj+SVM. 



coefficient (MCC) [38] were used to quantify the predic- 
tion performance. MCC allows us to overcome the 
shortcoming of accuracy on unbalanced data [38]. 

We measured the computation time on a Core 2 Duo 
3.16GHz CPU running Matlab and SVMlight. The com- 
putation time was divided into profile creation time, 
alignment time, classifier training time, and classification 
time. 

Results and discussion 

Performance of cleavage site prediction 

Table 3 shows the cleavage site prediction accuracy of 
TargetP and CSitePred [28] (a CRF-based predictor). It 
suggests that CSitePred is better than TargetP(P) in 
terms of predicting the cleavage sites of signal peptide 
(SP) but is poorer than TargetP(N). The results also 
suggest that while CSitePred is slightly inferior to Tar- 
getP in predicting the cleavage sites of mitochondria, it 
is significantly better than TargetP in predicting the 
cleavage sites of chloroplasts. Note that the overall 
accuracies depend heavily on the SP class because of the 
large number of signal peptides in the dataset (see Table 
2). 

Table 3 Cleavage-site prediction accuracies achieved by 
TargetP and CSitePred. For TargetP, (P) and (N) mean 
using the 'Plant' and 'Non-plant' option of the predictor, 
respectively. TargetP will invoke SignalP, ChloroP, or a 
program specialized in predicting mTP for cleavage site 
prediction. CSitePred is based on conditional random 
fields. 

Cleavage Site Predictor Cleavage Site Prediction Accuracy (%) 





SP 


mTP 


cTP 


Overall 


TargetP(P) 


71.49 


44.04 


8.82 


64.55 


TargetP(N) 


84.63 


46.69 


2.21 


75.28 


CSitePred 


79.40 


39.40 


31.62 


71.73 



The prediction accuracy of chloroplasts by TargetP 
shown in Table 3 is significantly lower than that in [32]. 
There are two reasons for this difference: (1) our dataset 
has sequence identity lower than that of [32] and (2) we 
consider predicting precisely the ground-truth sites as 
correct predictions whereas [32] considers predictions 
within ±2 positions of the ground-truth sites as correct 
predictions. In fact, if we relaxed the criterion of correct 
prediction to ±2 ground-truth positions, the prediction 
accuracy on chloroplasts achieved by TargetP increases 
to 47.06%. 

Sensitivity analysis 

To evaluate the effect of incorrect cleavage site predic- 
tion on the accuracy of subcellular localization, sensitiv- 
ity analysis was performed by truncating SP, mTP, and 
cTP at the ground-truth cleavage sites and plus/minus 
several positions of the ground-truths. Specifically, the 
sequence cut-off positions are 16, 8, and 2 amino acids 
upstream and 2, 16, 32, and 64 amino acids downstream 
from the ground-truth cleavage site. 

Fig. 6 shows that the overall accuracy of subcellular 
localization does not rely significantly on the precision 
of cleavage site prediction as long as the predicted sites 
are not too far away from the ground-truths. 

Apparently, mTP and cTP are more sensitive to the 
error of cleavage site prediction, which agrees with the 
fact that the signals of mTP and cTP are weaker. Locali- 
zation performance of these sequences degrades when 
the cut-off position drifts away significantly the ground- 
truth cleavage site. But the overall accuracy can be 
maintained at above 95% even if the drift is as large as 
-16 and +64 positions from the ground-truth. More- 
over, a forward drift of 64 positions from the ground 
truth cleavage site leads to a higher overall accuracy 
when compared to that of a backward drift of 16 posi- 
tions, which suggests that cutting sequences before their 
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Figure 6 Sensitivity of subcellular localization accuracy with respect 
to the (top) profile cut-off positions in Scheme I and (bottom) 
sequence cut-off positions in Scheme II (see Fig. 2). p is the ground- 
truth cleavage site. For "Cyt/Nuc" proteins, p is set to 170. 



cleavage sites may lose useful information in the signal 
pep-tides while including extra (may be irrelevant) infor- 
mation by cutting sequences after their cleavage sites is 
not detrimental to subcellular location accuracy. 

Profile-creation time 

Fig. 7 shows the score matrices obtained by the two 
profile creation schemes illustrated in Fig. 2. The figure 
shows that the two alignment score matrices exhibit a 
similar pattern, suggesting that classifiers based on these 
matrices will produce similar classification accuracy. 
This argument is confirmed by Table 4, which shows 



J 



i 



(a) (b) 

Figure 7 Profile-alignment score matrices produced by (left) 
Scheme I and (right) Scheme II in Fig. 2. 



that cutting the sequences at cleavage sites before input- 
ting to PSI-BLAST can reduce the profile creation time 
by 6 times without significant reduction in subcellular 
localization accuracy. 

Profile-alignment time 

Table 5 shows that the computation time for full-length 
profile alignment is striking — nearly thirty-five seconds 
per sequence, which suggests that full-length alignment 
is computationally prohibitive. Therefore, it is impera- 
tive to limit the length of the sequences or profiles 
before alignment. Table 5 also shows that truncating the 
sequences at their cleavage site positions leads to nearly 
a 20 folds reduction in alignment time without suffering 
from loss in subcellular localization performance. This is 
because the signal segment can be found in the N-ter- 
minus, and removing the amino acids beyond the clea- 
vage site helps the alignment focuses on the relevant 
features in the profiles and disregard noise. 

SVM versus PDA 

Table 6 shows that the training time of PDA and PDA- 
proj+SVM are only one-fifth of that of SVM. However, 
the accuracy of PDA and PDAproj+SVM are lower than 
that of SVM. 

Compared with state-of-the-art predictors 

We compared the accuracy of the proposed fusion of 
signal-based and homology-based methods with SubLoc 
[7], TargetP [5] and PairProSVM [13]. Table 7 shows 
that the overall accuracy of the proposed method (the 
5th row) is 5.2% higher than that of TargetP (3rd row) 
and is significantly better than that of SubLoc (1st row). 
Our method outperforms TargetP in Ext (SP) and Cyt/ 
Nuc prediction while performing worse than TargetP in 
predicting Mit and Chi. One limitation of TargetP is 
that users need to select either "Plant" or "Non-plant". 
If the former is selected, the performance of Ext and 
Cyt/Nuc degrade significantly, leading to a low overall 
accuracy; if the latter is selected, none of the chloroplast 
proteins can be correctly predicted. The cascaded fusion 
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Table 4 Average computation time to create a profile by PSI-BLAST using sequences of different length as input. In 
Scheme I, full-length sequences are presented to PSI-BLAST and the resulting profiles are truncated at the predicted 
cleavage sites. In Scheme II, truncation is applied to the sequences before presenting to PSI-BLAST. In both cases, 
CRFs (CSitePred) were used to predict the cleavage sites. 

Scheme Input to PSI-BLAST Profile Creation Time (second) Subcellular Localization Accuracy 

I Full-length sequences 30.5 91.69% 

II Sequences truncated at predicted cleavage sites 4.7 91.45% 



of cleavage site prediction and PairProSVM, on the 
other hand, can classify all four classes with fairly high 
accuracy, leading to a higher overall accuracy. 

The prediction accuracy and MCC of the proposed 
methods (Rows 4-10 in Table 7) are comparable to 
Pair-ProSVM (Row 4 in Table 7). The main improve- 
ment is on computation time reduction. 

Because ChloroP is weak in predicting the cleavage 
sites of chloroplasts (see Table 3), it is not a good candi- 
date for assisting PairProSVM. This is evident by the 
low subcellular localization accuracy of chloroplasts in 
Table 7 when TargetP is used as a cleavage site predic- 
tor. However, TargetP is fairly good at predicting the 
subcellular location of chloroplasts when it is used as a 
localization predictor. 

Among the four classes in Table 7, the subcellular 
localization accuracies of mitochondria and chloroplasts 
are generally lower than that of Ext and Cyt/Nuc. The 
reason may be that these transit peptides are less well 
characterized and their motifs are less conserved than 
those of secretary signal peptides [6]. 

Table 7 also suggests that the TargetP (N) is very effec- 
tive in assisting PairProSVM, leading to the highest pre- 
diction accuracy (92.6%) among all subcellular 
localization predictors. In particular, except for predict- 
ing Chi, TargetP in combination with PairProSVM can 
surpass the other methods in subcellular localization 
accuracy and MCC. 

Conclusions 

This paper has demonstrated that homology-based sub- 
cellular localization can be speeded up by reducing the 
length of the query amino acid sequences. Because 



shortening an amino acid sequence will inevitably throw 
away some information in the sequence, it is imperative 
to determine the best truncation positions. This paper 
shows that these positions can be determined by clea- 
vage site predictors such as TargetP and CSitePred. The 
paper also shows that as far as localization accuracy is 
concerned, it does not matter whether we truncate the 
sequences or truncate the profiles. However, truncating 
the sequence has computation advantage because this 
strategy can save the profile creation time by as much 
as 6 folds. 

Appendix: kernel discriminant analysis 

This appendix derives the formulations of kernel discri- 
minant analysis. The key idea lies in the equivalency 
between the optimal projection vectors in the Hilbert 
space, spectral space and empirical space. 

Input, Hilbert, spectral, and empirical Spaces 

Denote the mapping from an input space X into a Hil- 
bert space H as: 

6 : X -> H such that x &(x). 

In bioinformatics, X is a vectorial space for microarray 
data and a sequence space for DNA or protein 
sequences. Given a training dataset {xi,..., x N } in X and 
a kernel function K(x, y), an object can be represented 
by a vector of similarity with respect to all of the train- 
ing objects [39]: 

k{x) = [K(x l ,x),...,K(x N ,x)] T . 



Table 5 Profile-Alignment time and subcellular localization accuracy for different sequence cut-off positions in Scheme 
II. In the first column, "Full length" means that no sequence truncation was applied. "TargetP(P)" and "Tar-getP(IM)" 
mean that the cutoff position is determined by TargetP using the "Plant" option and "Non-plant" option, respectively. 
CSitePred is a cleavage site predictor based on conditional random fields. 



Seq. Cutoff position Alignment Time for Each Profile (sec.) Subcellular Localization Accuracy (%) 



Full length 


34.7 


91.64 


170 


4.7 


90.98 


Ground-truth 


1.9 


98.31 


Determined by TargetP(P) 


1.8 


89.08 


Determined by TargetP(N) 


1.7 


93.14 


Determined by CSitePred 


1.9 


91.45 
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Table 6 The computation time and performance of 
different classifiers in the subcellular localization task. 
The classification time is the time to classify a profile- 
alignment score vector with dimension equal to the 
number of training vectors. The training time is the time 
required to train a classifier, given a profile-alignment 
score matrix K. In PDAproj+SVM, PDA was applied to 
project the samples in the input space to a (C - 1)-dim 
space (C = 4 here); the projected vectors were then 



classified by RBF-SVMs. 



Classification 
Method 


Training Time 
(sec.) 


Classification Time 
(sec.) 


SubLoc 
Acc. 


SVM 


51.4 


0.7 


91.45% 


PDA 


9.9 


1.9 


90.24% 


PDAproj+SVM 


8.9 


0.1 


89.97% 



This Af-dim space, denoted by 7<T, is called empirical 
space. The associate kernel matrix is defined as 

K = [k(x 1 ),...,k(x N )]. 

The construction of the empirical space for vectorial 
and non-vectorial data are quite different. For the for- 
mer, the elements of K are a simple function of the cor- 
responding pair of vectors in X. For the latter, the 
elements in K are similarities between the correspond- 
ing pairs of objects. 

The kernel matrix K can be factorized with respect to 
the basis functions in H:K - O t O, where 
O = [0(x 1 ),..., 0{x N )] . Alternatively, it can be factorized 
via spectral decomposition: 

K = U T a 17 = 17 T a' a' U = (a '[/) t (a' U) = E T E, 
where £ _ A ijj . 

Denote the i-th row of E as e (i) [e (, 'W,...,e (, 'W)]. 
Because £E T = a 2 UU t a 2 = a tne rows °f E exhibit a 
vital orthogonality property: 



[A f ifi^j, 

where X t is the i-th. element of the diagonal of A. 

For any positive-definite kernel function K(x, y) and 
training dataset {%,... m X, there exists a (nonlinear) 
mapping from the original input space X to an Af-dim 
spectral space E: 

_i - 

e : X — > £ such that x e{x) = a 2 Uk(x). 

Note that K = E T E, i.e., 
E = {E T y\K = (L7 T A 2 y x K = a~ 2 UK • Therefore, 
e(Xi) = a"2 Ukfa), i = 1, . . . , N • 

Many kernel-based machine learning problems involve 
finding optimal projection vectors in H, E, and 7<T, which 
will be respectively denoted as w, v, and a. It can be 
shown [36] that the projection vectors are linearly 
related as follows: 

w T 6{x) = v T e{x) = a T k{x), (2) 

where we have used the relationships w - Oa and v - 
Ea. 

Orthogonal hyperplane principle (OHP) 

Assume that the dimension of H is M and that the 
training data in H are mass-centered. When M >N, all 
of the N training vectors {(0(x f ); i = 1,..., N} will fall on 
an (M -l)-dim data hyperplane. Mathematically, the 
data-hyperplane is represented by its normal vector p 
such that 0 T p = 1. The optimal decision-hyperplane in 
H (represented by w) must be orthogonal to the data- 
hyperplane: 
w Y p = 0 => a T 0 T p = 0 => oc T l = 0. 



Table 7 Subcellular localization performance achieved by different classifiers. The second column specifies the 
cleavage site predictors that were used for determining the positions at which the amino sequences were truncated. 
Notice that TargetP can perform both cleavage site prediction and subcellular localization. For Rows 4 and 5, TargetP 
was used as a cleavage site predictor, where "TargetP(P)" and "TargetP(N)" mean selecting plant or non-plant option 
in TargetP, respectively. For Rows 6-8 "CRF" means that conditional random fields were used for cleavage site 
prediction. 



Row Cleavage Site Predictor Localization Predictor Classification Accuracy (%) Matthew's correlation coefficient (MCC) 









Ext 


Mit 


Chi 


Cyt/Nuc 


Overall 


Ext 


Mit 


Chi 


Cyt/Nuc 


Overall 


1 




SubLoc [7] 


51.44 


55.83 




77.86 


66.79 












2 




TargetP (P) 


79.08 


88.02 


89.19 


69.57 


73.93 


0.79 


0.49 


0.79 


0.64 


0.65 


3 




TargetP (N) 


97.40 


89.22 


0.00 


87.82 


87.97 


0.93 


0.58 


0.00 


0.81 


0.84 


4 


TargetP(N) 


SVM 


97.26 


67.07 


36.49 


95.86 


92.63 


0.93 


0.70 


0.53 


0.86 


0.90 


5 


TargetP(N) 


PDA 


97.55 


61.68 


6.76 


95.61 


91.34 


0.91 


0.68 


0.26 


0.84 


0.88 


6 


TargetP(N) 


PDAproj+SVM 


97.26 


65.27 


37.84 


93.57 


91.10 


0.93 


0.64 


0.50 


0.83 


0.88 


7 


CRF 


SVM 


94.52 


63.47 


28.38 


95.86 


91.45 


0.90 


0.68 


0.45 


0.84 


0.89 


8 


CRF 


PDA 


94.81 


59.28 


1.35 


95.55 


90.24 


0.88 


0.67 


0.11 


0.82 


0.81 


9 


CRF 


PDAproj+SVM 


94.66 


63.47 


25.68 


93.63 


89.97 


0.90 


0.60 


0.41 


0.82 


0.87 
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Kernel Fisher discriminant analysis (KFDA) 

The objective of KFDA [40] is to determine an optimal 
discriminant function (linearly) expressed in the Hilbert 
space H: 

f(x) = w T 6(x) + b, 

where b is a bias to account for the fact that training 
data may not be mass-centered. The discriminant func- 
tion may be equivalently expressed in the Af-dim spec- 
tral space E: 

f{x) = v T e(x) + b. 

The finite-dimensional space E facilitates our analysis 
and design of optimal classifiers. In fact, the optimal 
projection vector v opt in E can be obtained by applying 
conventional FDA to the column vectors {e{x { )} . To 
derive the objective function of KFDA, let us define 



d = - 



d + + d_ 



-(d + l + -d_l_), 



(3) 



where d + ■ 



N_ 
NN. 



and d_ 



NN 



-; 1+ and l a€ « con- 
tain l's inentrfe's "corresponding to Classes C+ and Q € «, 
respectively, and O's otherwise; and N+ and N_ are the 
number of training samples in classes C + and C, respec- 
tively. It can be shown that the objective function of 
KFDA is: 



/kfdaM = 



v T S £ h v 



v T Edd T E T v 



v T S^v v r E{I-^-)E r v' 



(4) 



where 1 is an Af-dim vector with all elements equal to 
1 and sf =Edd T E T and = E(I 



■y b - nun r, ciiiM. o vv - ■ .[ . - 1 are 
between-class and within-class covariance matrices in E 
space, respectively. 

Perturbational discriminant analysis (PDA) 

The FDA and KFDA are based on the assumption that 
the observed data are perfectly measured. It is however 
crucial to take into account the inevitable perturbation 
of training data. For the purpose of designing practical 
classifiers, we can adopt the following perturbational 
discriminant analysis (PDA). 

It is assumed that the observed data is contaminated 
by additive white noise in the spectral space. Denote the 
center-adjusted matrix of E as £ and the uncorrelated 
noise as N, then the perturbed scattered matrix is 



(E + N)(E + N) T -£E T +pI 



= E 



ll 1 

N 



E T + pi 



where p is a parameter representing the noise level. Its 
value can sometimes be empirically estimated if the 
domain knowledge is well established a priori. Under 
the perturbation analysis, the kernel Fisher score in Eq. 
4 is modified to the following perturbed variant: 



/pdaOO = 



v T Edd T E T v 



v T [E(l-^)E^ P l]v' 



(5) 



By taking the derivative of /pdaW with respect to V, 
the optimal solution to Eq. 5 can be obtained as: 



"opt = 



E I- 



11 



T \ 



N 



E T + pi 



Ed, 



and using the Sherman-Morrison- Woodbury identity 
it can be shown that [41] 



!/„,„= (EE T + pJ) _1 B(d-»7l) 

= (a + piy l E{d-t]i) 



(6) 



where 77 is a scalar whose value can be determined 
through the optimal solution in K space as follows. 

Recall from Eq. 2 that dot-products in the three 
spaces are equivalent. Therefore, the discriminant func- 
tion in K space can be written as: 



f{x) = a T k{x) + b. 



(7) 



Given the optimal solution v opt in the E space, the 
corresponding optimal solution in the K space is 1 



a opt ~ E v opt 



= U T A 2(A + pI) _1 Akj(d-77l) 
= {K + piy\d-ril), 



(8) 



where we have used K - U T AU and £ _ a ijj . Note 
that unlike Eq. 6, Eq. 8 does not require spectral decom- 
position, thus offering a fast close-form solution. Now 
using the orthogonal hyperplanes principle, we have 



a 0 T pt l = (d T -7 ? l T )(K + p/)- 1 l = 0 



d T {K + P iyh 



(9) 



Note that unlike Eq. 6, Eq. 8 does not require spectral 
decomposition, thus offering a fast close-form solution. 
Also, Eq. 6 suggests that p has more regularization effect 
on the minor components with small eigenvalues than 
on the major components with large eigenvalues. This 
serves well the purpose of regularization. Consequently, 
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a PDA classifier will use less proportion of minor (and 
risky) components and more of major components. 
Therefore, the parameter p plays two major roles: (1) it 
can assure the Mercer condition and invertibility of the 
kernel matrix; and (2) it can suppress the weights 
assigned to the risker and less resilient components. 

The remaining unknown is the bias b. Recall from Eq. 
2 that dot-products in the three spaces are equivalent. 
Therefore, the discriminant function in K space can be 
written as: 

f{x) = al pt k{x) + b. (10) 
Putting all training data x t into Eq. 10, we have 

Yi = a optK x i) + h ' i = l/.../N 
=> Y = Ka o P t + bl 

where y t - 1 when x t e C+ and y t - -1 when x t e C_. 
Since K is invertible, we have a opt = IC 1 (y-bl). Eqs. 6 
and 8 suggest that perturbation in the spectral space 
can be represented by shifting the diagonal of K by p. 
Therefore, taking the perturbation in the spectral space 
into account, we have 

a opt = (K + pi)' 1 (y-bl). (11) 

Note that the solutions given in Eq. 8 and Eq. 11 are 
equivalent. Now, b can be determined by using the 
orthogonal hyperplane principle to maximize the inter- 
class separability: 

a T opt l = 0^(y T -bl T )(K + pI)-h = 0 

^ b _ y T {K + P iyh d2) 
\ T {K + piyh' 

Note that the solutions of a and b in Eqs. 11 and 12 
are equivalent to the least-squares SVM [42], although 
the way to derive the solutions are different. 
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